When designing a new API for a large project, developers need to make smart design choices so that their code base can grow sustainably. To ensure that new API components are well designed, developers can learn from existing API components. However, the lack of standardized method for comparing API designs makes this learning process time-consuming and difficult. To address this gap we developed the API-Spector, to the best of our knowledge one of the first API-to-API specification recommendation engines. API-Spector retrieves relevant specification components written in OpenAPI (a widely adopted language used to describe web APIs). API-Spector presents several significant contributions, including: (1) novel methods of processing and extracting key information from OpenAPI specifications, (2) innovative feature extraction techniques that are optimized for the highly technical API specification domain, and (3) a novel log-linear probabilistic model that combines multiple signals to retrieve relevant and high quality OpenAPI specification components given a query specification. We evaluate API-Spector in both quantitative and qualitative tasks and achieve an overall of 91.7% recall@1 and 56.2% F1, which surpasses baseline performance by 15.4% in recall@1 and 3.2% in F1. Overall, API-Spector will allow developers to retrieve relevant OpenAPI specification components from a public or internal database in the early stages of the API development cycle, so that they can learn from existing established examples and potentially identify redundancies in their work. It provides the guidance developers need to accelerate development process and contribute thoughtfully designed APIs that promote code maintainability and quality.
translated by 谷歌翻译
源代码(MLONCODE)上的机器学习有望改变软件的交付方式。通过挖掘软件伪像之间的上下文和关系,mloncode通过代码自动生成,代码建议,代码自动标记和其他数据驱动的增强功能增强了软件开发人员的功能。对于许多任务中,代码的脚本级别表示足够,但是,在许多情况下,要考虑各种依赖关系和存储库结构的存储库级表示,例如,自动标记存储库具有主题或自动记录的存储库。代码等,用于计算存储库级表示的现有方法受(a)依赖代码的自然语言文档(例如,读书文件)(b)方法/脚本级表示的天真聚集,例如,通过串联或平均值。本文介绍了一个深度神经网络,该网络可直接从源代码中生成可公开可用的GitHub代码存储库的存储库嵌入。主题结合了一种注意机制,该机制将源代码,完整依赖关系图和脚本级别的文本信息投射到密集的存储库级表示中。为了计算存储库级别的表示,局部训练可以预测与存储库相关的主题,该主题是在公开可用的GitHub存储库数据集中,这些存储库与他们的地面真相主题标签一起爬行。我们的实验表明,局部计算的嵌入能够胜过多个基线,包括通过在存储库自动标记的任务下平均或串联来天真地结合方法级表示的基线。
translated by 谷歌翻译
机器学习源代码(MLONCODE)是一项流行的研究领域,该研究领域是由大规模代码存储库的可用性和开发挖掘源代码的强大概率和深度学习模型驱动的流行研究领域。代码到代码建议是MLONCODE中的任务,旨在推荐相关的,不同和简洁的代码片段,这些代码代码代码代码代码段可以在其开发环境(IDE)中使用开发人员编写的代码扩展。代码代码推荐引擎通过减少IDE切换和增加代码重用,保持提高开发人员生产力的承诺。现有的代码代码推荐引擎不会优雅地扩展到大的CodeBases,在代码存储库大小增加时,展示查询时间的线性增长。此外,现有的代码代码推荐引擎未能考虑排名函数中的代码存储库的全局统计信息,例如代码片段长度的分发,导致子最优检索结果。我们通过\ emph {senatus}来解决这两个弱点,这是一个新的代码代码推荐引擎。在SeNatus的核心是\ emph {de-skew} lsh一个新的局部敏感散列(lsh)算法,其索引快速(子线性时间)检索数据,同时使用新颖的抽象语法抵消片段长度分布中的偏差基于树的特征评分和选择算法。我们通过自动评估和专家开发人员用户学习评估SENATU,并发现该建议具有比竞争基线更高的质量,同时实现更快的搜索。例如,在CodeSearchNet DataSet上,我们显示SeNatus通过6.7 \%F1提高性能,并且与Facebook Aroma对代码到代码建议的任务相比,Query Time 16x更快。
translated by 谷歌翻译
In today's data-driven society, supervised machine learning is rapidly evolving, and the need for labeled data is increasing. However, the process of acquiring labels is often expensive and tedious. For this reason, we developed ALANNO, an open-source annotation system for NLP tasks powered by active learning. We focus on the practical challenges in deploying active learning systems and try to find solutions to make active learning effective in real-world applications. We support the system with a wealth of active learning methods and underlying machine learning models. In addition, we leave open the possibility to add new methods, which makes the platform useful for both high-quality data annotation and research purposes.
translated by 谷歌翻译
提高搜索结果的质量可以显着增强用户的体验和与搜索引擎的交战。尽管机器学习和数据挖掘领域的最新进展,但正确对特定用户搜索查询的项目进行了分类一直是一个长期的挑战,这仍然有很大的改进空间。本文介绍了“购物查询数据集”,这是一个很大的亚马逊搜索查询和结果的大型数据集,以促进研究以提高搜索结果的质量,以促进研究。该数据集包含大约1.3万个独特的查询和260万手动标记(查询,产品)相关性判断。该数据集具有多语言,其中包括英语,日语和西班牙语的查询。购物查询数据集用于KDDCUP'22挑战之一。在本文中,我们描述了数据集并介绍了三个评估任务以及基线结果:(i)对结果列表进行排名,(ii)将产品结果分类为相关性类别,以及(iii)确定给定查询的替代产品。我们预计这些数据将成为产品搜索主题的未来研究的黄金标准。
translated by 谷歌翻译
我们考虑了在自主移动机器人的视觉传感数据流中检测的问题,这些语义模式相对于机器人在类似环境中的先前经验而言是不寻常的(即异常)。这些异常可能表明危害不可预见,并且在失败昂贵的情况下,可以用来触发避免行为。我们贡献了在机器人勘探方案中获得的三个基于图像的新型数据集,其中包括超过200k的标记帧,涵盖了各种类型的异常。在这些数据集上,我们研究了基于以不同尺度运行的自动编码器的异常检测方法的性能。
translated by 谷歌翻译